DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs

113

Tr(·) represents the trace of a matrix. However, the item wt

∂αt of Eq. 4.43 is undefined and

unsolvable based on the normal backpropagation process. To address this problem, we pro-

pose a decoupled optimization method as follows. In the following, we omit the superscript

·t and define ˜L as

˜L = (L(α, w)

w

)T /α,

(4.44)

which considers the coupling optimization problem as in Eq. 4.42. Note that R(·) is only

considered when backtracking. Thus, we have

L(α, w)

w

= Tr[α ˜Lw

∂α ].

(4.45)

For simplifying the derivation, we rewrite ˜L as [˜g1, ˜ge, · · · , ˜gE], where each ˜ge is a column

vector. Assuming that wm and αi,j are independent when m ! = j, αi,j denotes a specific

element in the matrix α, we have

(w

∂α )m =

⎢⎢⎢⎢⎢⎣

0

...

wm

∂α1,m

...

0

.

.

.

0

...

wm

∂αe,m

...

0

.

.

.

0

...

wm

∂αE,m

...

0

⎥⎥⎥⎥⎥⎦

E×M

(4.46)

and with rewritten α as a column vector [α1, αe, · · · , αE]T with each αe is a row vector, we

have

α ˜L =

⎢⎢⎢⎢⎣

α1˜g1

...

α1˜ge

...

α1˜gE

.

.

.

αe˜g1

...

αe˜ge

...

αe˜gE

.

.

.

αE˜g1

...

αE˜ge

...

αE˜gE

⎥⎥⎥⎥⎦

E×E

.

(4.47)

Combing Eq. 4.46 and Eq. 4.47, the matrix in the trace item of Eq. 4.44 can be written as

α ˜L(w

∂α )m =

⎢⎢⎢⎢⎢⎢⎣

0

...

α1

E

e=1 ˜ge

wm

∂αe,m

...

0

.

.

.

0

...

αe

E

e=1 ˜ge

wm

∂αe,m

...

0

.

.

.

0

...

αE

E

e=1 ˜ge

wm

∂αe,m

...

0

⎥⎥⎥⎥⎥⎥⎦

E×M

.

(4.48)

Thus the whole matrix α ˜L w

α is with the size of E × M × M. After the above derivation, we

compute the e-th component of the trace item in Eq. 4.44 as

Tr[α ˜L(w

∂α )]e = αe

M



m=1

E



e=1

˜ge

wm

∂αe,m

(4.49)

Noting that in the vanilla propagation process, αt+1 = αt η1

L(αt)

∂αt , thus combining

Eq. 4.49 we have

˜αt+1 = αt+1 η

⎢⎢⎢⎢⎢⎢⎣

M

m=1

E

e=1 ˜ge

wm

∂αe,m

.

M

m=1

E

e=1 ˜ge

wm

∂αe,m

.

M

m=1

E

e=1 ˜ge

wm

∂αe,m

⎥⎥⎥⎥⎥⎥⎦

⎢⎢⎢⎢⎣

α1

.

αe

.

αE

⎥⎥⎥⎥⎦

= αt+1 + ηψt αt,

(4.50)